10:00
To this point, we haven’t really talked about how we should organize a repository. A repository is just a set of files to track over time, but how do we organize those files?
I am arguably not the best guide for this, as I am generally a disorganized person and it shows in my older repos.
On some level, this is to be expected with data analysis/data science - we rarely work in a linear progression with the same set of files from project to project.
But I’ve started to use a specific style for organization that seems to suit data science projects well.
It’s no secret that good analyses are often the result of very scattershot and serendipitous explorations. Tentative experiments and rapidly testing approaches that might not work out are all part of the process for getting to the good stuff, and there is no magic bullet to turn data exploration into a simple, linear progression.
A well-defined, standard project structure means that a newcomer can begin to understand an analysis without digging in to extensive documentation. It also means that they don’t necessarily have to read 100% of the code before knowing where to look for very specific things.
It’s less important to have the perfect organization for a given project than it is to have some sort of standard that everyone understands and uses.
The goal is to organize projects in a way that will make it easier for others and your future self to remember.
Sadly, this template is intended for Python, but we can adapt it for R easily enough. Let’s zoom in a bit on specific pieces.
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
| the creator's initials, and a short `-` delimited description, e.g.
| `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment.
│
├── src <- Source code for use in this project.
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
Don’t ever edit your raw data, especially not manually, and especially not in Excel. Don’t overwrite your raw data. Don’t save multiple versions of the raw data. Treat the data (and its format) as immutable.
The code you write should move the raw data through a pipeline to your final analysis. You shouldn’t have to run all of the steps every time you want to make a new figure (see Analysis is a DAG), but anyone should be able to reproduce the final products with only the code in src and the data in data/raw.
Also, if data is immutable, it doesn’t need source control in the same way that code does. Therefore, by default, the data folder is included in the .gitignore file
Often in an analysis you have long-running steps that preprocess data or train models. If these steps have been run already (and you have stored the output somewhere like the data/interim directory), you don’t want to wait to rerun them every time. We prefer make for managing steps that depend on each other, especially the long-running ones.
It’s hard to describe exactly what a functional style is, but generally I think it means decomposing a big problem into smaller pieces, then solving each piece with a function or combination of functions. When using a functional style, you strive to decompose components of the problem into isolated functions that operate independently. Each function taken by itself is simple and straightforward to understand; complexity is handled by composing functions in various ways.
CookieCutter Data Science doesn’t go into much detail on what your src code should look like, but I have found it naturally suits a functional programming style.
Rather than writing scripts that execute tasks, it’s generally better to write a series of functions that are then called and used in a pipeline.
Another thing that Cookie Cutter Data Science helps address: what should even be a repo? When we’re working on a project, how do we define and organize our code?
Do we create one repository for all of our data science projects? Do we create one repository per project?
This more or less becomes an argument between monorepos vs multi-repos.
A monorepo is one repository that contains code for a lot of different projects and tasks.
Imagine you have one big project you’re working on, containing a lot of separate pieces and code. The monorepo approach says, throw it all in into the same repo.
As opposed to a multi-repo, where aspects of a larger project are isolated and separated into individual repositories.
##
The Cookie Cutter Data Science approach is much more conducive towards the multi-repo approach:
It becomes a lot harder to define requirements and reproduce the environment to run code when you have a gigantic, monolithic repository.
But what if we want to re-use code across multiple repositories!
More on this later, but basically this is where submodules might come into play. . . .
Or, just create another repo in the form of a package that can be used across multiple projects.
Given these principles, most of my repos end up being organized in the following way:
├── _targets <- stores the metadata and objects of your pipeline
├── renv <- information relating to your R packages and dependencies
├── data <- data sources used as an input into the pipeline
├── src <- functions used in project/targets pipeline
| ├── data <- functions relating to loading and cleaning data
| ├── models <- functions involved with training models
| ├── reports <- functions used in generating tables and visualizations for reports
├── _targets.R <- script that runs the targets pipeline
├── renv.lock <- lockfile detailing project requirements and dependencies
Again, I’m not saying that this is THE OBJECTIVELY CORRECT WAY TO ORGANIZE AN R PROJECT. But it’s been a useful starting point for me in my work.
One of the key pillars to this organization is renv.
Let’s go back to the issues we had in running certain files in the starwars or board_games repo.
How often do you want to run someone else’s code, only to find that you need to install additional packages?
How often do you try to run someone else’s code only to discover that they’re using a deprecated function?
How many times have you gotten a headache because dplyr can’t make up its mind between mutate_if, mutate_at, mutate_all, and mutate(across())?
The renv package aims to solve most of these problems by helping you to create reproducible environments for R projects.
renv allows you to scan and find packages used in your project. This produces a list of packages with their current versions and dependencies. Using renv with a project adds three pieces to your repo:
This is the key magic that makes renv work: instead of having one library containing the packages used in every project, renv gives you a separate library for each project
We then add (pieces) of renv/library, renv.lock, and .Rprofile to our repository and commit them.
If we make a change to our code, we use renv to track whether that code has introduced, removed, or changed our dependencies. When we commit the change to our code, we will also commit a change to our renv.lock file.
In this way, using Git + renv allows us to store a history of how our project dependencies have changed with every commit.
So, how do we do this?
We will need to get to know a few functions from renv.
renv to an existing project: guns-datarenv::init()renv::dependencies()renv::status()renv::status()renv::snapshot()renv::init()renv::dependencies()renv::status()renv::snapshot()renv, renv.lock, and .Rprofile10:00
10:00